An introduction to resampling methods
Reference: Data Science Chapter 5
From the New England Journal of Medicine in 2006:
We randomly assigned patients with resectable adenocarcinoma of the stomach, esophagogastric junction, or lower esophagus to either perioperative chemotherapy and surgery (250 patients) or surgery alone (253 patients)…. With a median follow-up of four years, 149 patients in the perioperative-chemotherapy group and 170 in the surgery group had died. As compared with the surgery group, the perioperative-chemotherapy group had a higher likelihood of overall survival (five-year survival rate, 36 percent vs. 23 percent).
Conclusion:
Conclusion:
Not so fast! In statistics, we ask “what if?” a lot:
Conclusion:
Always remember two basic facts about samples:
Conclusion:
By “quantifying uncertainty,” we mean filling in the blanks.
In stats, we equate trustworthiness with stability:
\[ \begin{array}{r} \mbox{Confidence in} \\ \mbox{your estimates} \\ \end{array} \iff \begin{array}{l} \mbox{Stability of those estimates} \\ \mbox{under the influence of chance} \\ \end{array} \]
For example:
Let's work through a thought experiment…
Imagine Andrey Kolmorogov on four-day fishing trip.
At right we see the sampling distribution for both \( \beta_0 \) and \( \beta_1 \).
Suppose we are trying to estimate some population-level quantity \( \theta \): the parameter of interest.
So we take a sample from the population: \( X_1, X_2, \ldots, X_N \).
We use the data to form an estimate \( \hat{\theta}_N \) of the parameter.
Suppose we are trying to estimate some population-level quantity \( \theta \): the parameter of interest.
So we take a sample from the population: \( X_1, X_2, \ldots, X_N \).
We use the data to form an estimate \( \hat{\theta}_N \) of the parameter.
Now imagine repeating this process thousands of times!
Estimator: any method for estimating the value of a parameter (e.g. sample mean, sample proportion, slope of OLS line, etc).
Sampling distribution: the probability distribution of an estimator \( \hat{\theta}_N \) under repeated samples of size \( N \).
Bias: Let \( \bar{\theta}_N = E(\hat{\theta}_N) \) be the mean of the sampling distribution. The bias of \( \hat{\theta}_N \) is \( (\bar{\theta}_N - \theta) \): the difference between the average answer and the truth.
Unbiased estimator: \( (\bar{\theta}_N - \theta) = 0 \).
Standard error: the standard deviation of an estimator's sampling distribution
\[ \begin{aligned} \mbox{se}(\hat{\theta}_N) &= \sqrt{ \mbox{var}(\hat{\theta}_N) } \\ &= \sqrt{ E[ (\hat{\theta}_N - \bar{\theta}_N )^2] } \\ &= \mbox{Typical deviation of $\hat{\theta}_N$ from its average} \end{aligned} \]
“If I were to take repeated samples from the population and use this estimator for every sample, how much does the answer vary, on average?”
If an estimator is unbiased, then
\[ \begin{aligned} \mbox{se}(\hat{\theta}_N) &= \sqrt{ E[ (\hat{\theta}_N - \bar{\theta}_N )^2] } \\ &= \sqrt{ E[ (\hat{\theta}_N - \theta )^2] } \\ &= \mbox{Typical deviation of $\hat{\theta}_N$ from the truth} \end{aligned} \]
“If I were to take repeated samples from the population and use this estimator for every sample, how big of an error do I make, on average?”